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Abstract —The explosive growth in big data has attracted 
much attention in designing efficient indexing and search 
methods recently. In many critical applications such as 
large-scale search and pattern matching, finding the near¬ 
est neighbors to a query is a fundamental research prob¬ 
lem. However, the straightforward solution using exhaus¬ 
tive comparison is infeasible due to the prohibitive compu¬ 
tational complexity and memory requirement. In response. 
Approximate Nearest Neighbor (ANN) search based on 
hashing techniques has become popular due to its promis¬ 
ing performance in both efficiency and accuracy. Prior ran¬ 
domized hashing methods, e.g., Locality-Sensitive Hashing 
(LSH), explore data-independent hash functions with ran¬ 
dom projections or permutations. Although having elegant 
theoretic guarantees on the search quality in certain metric 
spaces, performance of randomized hashing has been shown 
insufficient in many real-world applications. As a remedy, 
new approaches incorporating data-driven learning methods 
in development of advanced hash functions have emerged. 
Such learning to hash methods exploit information such as 
data distributions or class labels when optimizing the hash 
codes or functions. Importantly, the learned hash codes are 
able to preserve the proximity of neighboring data in the 
original feature spaces in the hash code spaces. The goal 
of this paper is to provide readers with systematic under¬ 
standing of insights, pros and cons of the emerging tech¬ 
niques. We provide a comprehensive survey of the learning 
to hash framework and representative techniques of vari¬ 
ous types, including unsupervised, semi-supervised, and su¬ 
pervised. In addition, we also summarize recent hashing 
approaches utilizing the deep learning models. Finally, we 
discuss the future direction and trends of research in this 
area. 

Index Terms —Learning to hash, approximate nearest 
neighbor search, unsupervised learning, semi-supervised 
learning, supervised learning, deep learning. 

I. Introduction 

The advent of Internet has resulted in massive infor¬ 
mation overloading in the recent decades. Nowadays, the 
World Wide Web has over 366 million accessible websites, 
containing more than I trillion webpagetQ. For instance, 
Twitter receives over 100 million tweets per day, and Ya¬ 
hoo! exchanges over 3 billion messages per day. Besides 
the overwhelming textual data, the photo sharing website 
Flickr has more than 5 billion images available, where im¬ 
ages are still being uploaded at the rate of over 3,000 
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images per minute. Another rich media sharing website 
YouTube receives more than 100 hours of videos uploaded 
per minute. Due to the dramatic increase in the size of 
the data, modern information technology infrastructure 
has to deal with such gigantic databases. In fact, com¬ 
pared to the cost of storage, searching for relevant content 
in massive databases turns out to be even a more chal¬ 
lenging task. In particular, searching for rich media data, 
such as audio, images, and videos, remains a major chal¬ 
lenge since there exist major gaps between available solu¬ 
tions and practical needs in both accuracy and computa¬ 
tional costs. Besides the widely used text-based commer¬ 
cial search engines such as Google and Bing, content-based 
image retrieval (CBIR) has attracted substantial attention 
in the past decade [1]. Instead of relying on textual key¬ 
words based indexing structures, CBIR requires efficiently 
indexing media content in order to directly respond to vi¬ 
sual queries. 

Searching for similar data samples in a given database 
essentially relates to the fundamental problem of nearest 
neighbor search [5] . Exhaustively comparing a query point 
q with each sample in a database X is infeasible because 
the linear time complexity DdAl) tends to be expensive 
in realistic large-scale settings. Besides the scalability is¬ 
sue, most practical large-scale applications also suffer from 
the curse of dimensionality [3], since data under mod¬ 
ern analytics usually contains thousands or even tens of 
thousands of dimensions, e.g., in documents and images. 
Therefore, beyond the infeasibility of the computational 
cost for exhaustive search, the storage constraint originat¬ 
ing from loading original data into memory also becomes 
a critical bottleneck. Note that retrieving a set of Ap¬ 
proximate Nearest Neighbors (ANNs) is often sufficient for 
many practical applications. Hence, a fast and effective 
indexing method achieving sublinear (odAD), logarithmic 
(0(log|A|)), or even constant (D(l)) query time is desired 
for ANN search. Tree-based indexing approaches, such as 
KD tree [4] , ball tree [5] , metric tree [6] , and vantage point 
tree [7], have been popular during the past several decades. 
However, tree-based approaches require significant storage 
costs (sometimes more than the data itself). In addition, 
the performance of tree-based indexing methods dramat¬ 
ically degrades when handling high-dimensional data [8]. 
More recently, product quantization techniques have been 
proposed to encode high-dimensional data vectors via sub¬ 
space decomposition for efficient ANN search [9] [TO]. 

Unlike the recursive partitioning used by tree-based in¬ 
dexing methods, hashing methods repeatedly partition the 
entire dataset and derive a single hash ’biti from each par- 

^ Depending on the type of the hash function used, each hash may 
return either an integer or simply a binary bit. In this survey we 
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titioning. In binary partitioning based hashing, input data 
is mapped to a discrete code space called Hamming space, 
where each sample is represented by a binary code. Specif¬ 
ically, given N Z?-dim vectors X G the goal of hash¬ 
ing is to derive suitable K-hit binary codes Y G To 

generate Y, K binary hash functions {hk : i—> 

are needed. Note that hashing-based ANN search tech¬ 
niques can lead to substantially reduced storage as they 
usually store only compact binary codes. For instance, 
80 million tiny images (32 x 32 pixels, double type) cost 
around 600G bytes [ 15 , but can be compressed into 64-bit 
binary codes requiring only 600M bytes! In many cases, 
hash codes are organized into a hash table for inverse table 
lookup, as shown in Figure [TJ One advantage of hashing- 
based indexing is that hash table lookup takes only con¬ 
stant query time. In fact, in many cases, another alter¬ 
native way of finding the nearest neighbors in the code 
space by explicitly computing Hamming distance with all 
the database items can be done very efficiently as well. 

Hashing methods have been intensively studied and 
widely used in many different fields, including com¬ 
puter graphics, computational geometry, telecommuni¬ 
cation, computer vision, etc., for several decades [T^ . 
Among these methods, the randomized scheme of Locality- 
Sensitive Hashing (LSH) is one of the most popular 
choices [TB]. A key ingredient in LSH family of techniques 
is a hash function that, with high probabilities, returns the 
same bit for the nearby data points in the original met¬ 
ric space. LSH provides interesting asymptotic theoretical 
properties leading to performance guarantees. However, 
LSH based randomized techniques suffer from several cru¬ 
cial drawbacks. First, to achieve desired search precision, 
LSH often needs to use long hash codes, which reduces the 
recall. Multiple hash tables are used to alleviate this is¬ 
sue, but it dramatically increases the storage cost as well 
as the query time. Second, the theoretical guarantees of 
LSH only apply to certain metrics such as £p (p G (0,2]) 
and Jaccard M- However, returning ANNs in such met¬ 
ric spaces may not lead to good search performance when 
semantic similarity is represented in a complex way instead 
of a simple distance or similarity metric. This discrepancy 
between semantic and metric spaces has been recognized 
in the computer vision and machine learning communities, 
namely as semantic gap m- 

To tackle the aforementioned issues, many hashing 
methods have been proposed recently to leverage ma¬ 
chine learning techniques to produce more effective hash 
codes m- The goal of learning to hash is to learn data- 
dependent and task-specific hash functions that yield com¬ 
pact binary codes to achieve good search accuracy m- 
In order to achieve this goal, sophisticated machine learn¬ 
ing tools and algorithms have been adapted to the proce¬ 
dure of hash function design, including the boosting algo¬ 
rithm m, distance metric learning m, asymmetric bi¬ 
nary embedding [^, kernel methods [21] [22, compressed 
sensing [23] . maximum margin learning [24] . sequential 

primarily focus on binary hashing techniques as they are used most 
commonly due to their computational and storage efficiency. 


learning [^, clustering analysis [55] j semi-supervised 
learning [T3], supervised learning [57] [12, graph learn¬ 
ing [55], and so on. For instance, in the specific applica¬ 
tion of image search, the similarity (or distance) between 
image pairs is usually not defined via a simple metric. Ide¬ 
ally, one would like to provide pairs of images that contain 
“similar” or “dissimilar” images. From such pairwise la¬ 
beled information, a good hashing mechanism should be 
able to generate hash codes which preserve the semantic 
consistency, i.e., semantically similar images should have 
similar codes. Both the supervised and semi-supervised 
learning paradigms have been explored using such pair¬ 
wise semantic relationships to learn semantically relevant 
hash functions [U [12 [30] [ 35 . In this paper, we will sur¬ 
vey important representative hashing approaches and also 
discuss the future research directions. 

The remainder of this article is organized as follows. In 
Section im we will present necessary background informa¬ 
tion, prior randomized hashing methods, and the motiva¬ 
tions of studying hashing. Section IHII gives a high-level 
overview of emerging learning-based hashing methods. In 
Section EYl we survey several popular methods that fall 
into the learning to hash framework. In addition, SectionIVl 
describes the recent development of using neural networks 
to perform deep learning of hash codes. Section |Vl| dis¬ 
cusses advanced hashing techniques and large-scale appli¬ 
cations of hashing. Several open issues and future direc¬ 
tions are described in Section IVlIl 

IF Notations and Background 

In this section, we will first present the notations, as 
summarized in Table lU Then we will briefly introduce the 
conceptual paradigm of hashing-based ANN search. Fi¬ 
nally, we will present some background information on 
hashing methods, including the introduction of two well- 
known randomized hashing techniques. 

A. Notations 

Given a sample point x G K^, one can employ a set of 
hash functions H = {hi,-- - ,hK} to compute a K-hit bi¬ 
nary code y = [yi, - - - , j/^} for x as 

y = {/ii(x),--- ,/i 2 (x),--- ,hKix)}, (1) 

where the bit is computed as yk = hfe(x). The hash 
function performs the mapping as hj- : —> B. Such a 
binary encoding process can also be viewed as mapping 
the original data point to a binary valued space, namely 
Hamming space: 

iL : x-^ {/ii(x),---,/iif(x)}. (2) 

Given a set of hash functions, we can map all the items in 
the database X = g to the corresponding 

binary codes as 

Y = H{X) = {hi{X), h2{X), --- , hK{X)}, 
where the hash codes of the data X are Y G 
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TABLE I 

Summary Of Notations 


Symbol 

Definition 

N 

number of data points 

D 

dimensionality of data points 

K 

number of hash bits 

hj 

indices of data points 

k 

index of a hash function 

Xi G , Xj G R-® 

the fth and jth data point 

Si , Sj 

the ith and jth set 

q, gR^ 

a query point 

X=[xi,...,x^]gR^x^ 

data matrix with points as columns 

y, g{ 1,-1}*^, or y^GjO,!}'^ 

hash codes of data points Xi and Xj 

yfe.Gl^xl 

the fc-th hash bit of N data points 

Y = [yi,---,y^]Gl^xN 

hash codes of data X 

Oij 

angle between data points Xi and xj 

hk-.-R^ ^{1,-1} 

the fc-th hash function 


K hash functions 

J {Si , Sj ) 

Jaccard similarity between sets Si and Sj 

./(x*,xj) 

Jaccard similarity between vectors Xi and Xj 

d-H{yi,yj) 

Hamming distance between y^ and y^ 

dwn{yi,yj) 

weighted Hamming distance between y^ and jj 

(xj,xj) eM 

a pair of similar points 

(x^, Xj ) G C 

a pair of dissimilar points 

(q*,x+,x-) 

a ranking triplet 

Sij = sim{xi,xj) 

similarity between data points Xi and Xj 

SgR^xy 

similarity matrix of data X 

'Pvt 

a hyperplane with its normal vector w 


After computing the binary codes, one can perform ANN 
search in Hamming space with significantly reduced com¬ 
putation. Hamming distance between two binary codes 
and yj is defined as 

K 

d-H{yz,yj) = \yi-yj\ = X! I^fc(x*)-hfc(xj)|, (3) 

k = l 

where y^ = [/ii(xi), • • • ,/ife(xi), • • • ,/ix(xi)] and 

yj = ,hK{xj)]. Note that 

the Hamming distance can be calculated in an efficient 
way as a bitwise logic operation. Thus, even conduct¬ 
ing exhaustive search in the Hamming space can be 
significantly faster than doing the same in the original 
space. Furthermore, through designing a certain indexing 
structure, the ANN search with hashing methods can be 
even more efficient. Below we describe the pipeline of a 
typical hashing-based ANN search system. 

B. Pipeline of Hashing-based ANN Seareh 

There are three basic steps in ANN search using hash¬ 
ing techniques: designing hash functions, generating hash 
codes and indexing the database items, and online query¬ 
ing using hash codes. These steps are described in detail 
below. 


B.l Designing Hash Functions 

There exist a number of ways of designing hash func¬ 
tions. Randomized hashing approaches often use random 
projections or permutations. The emerging learning to 
hash framework exploits the data distribution and often 
various levels of supervised information to determine opti¬ 
mal parameters of the hash functions. The supervised in¬ 
formation includes pointwise labels, pairwise relationships, 
and ranking orders. Due to their efficiency, the most com¬ 
monly used hash functions are of the form of a generalized 
linear projection: 

hk{x) =sgn{f{vflx + bk)) . (4) 

Here /(•) is a prespecified function which can be pos¬ 
sibly nonlinear. The parameters to be determined are 
representing the projection vector and 

the corresponding intercept bk- During the training pro¬ 
cedure, the data X, sometimes along with supervised in¬ 
formation, is used to estimate these parameters. In ad¬ 
dition, different choices of /(•) yield different properties 
of the hash functions, leading to a wide range of hash¬ 
ing approaches. For example, LSH keeps /(•) to be an 
identity function, while shift-invariant kernel-based hash¬ 
ing and spectral hashing choose /(•) to be a shifted cosine 
or sinusoidal function [32] [33] . 

Note that, the hash functions given by generate the 
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Binary Hashing 


\*/ •; 


indexing 


inverse Lookup 



->^2 
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1110 


0110 



hash tabie items 


database items hash codes 
Fig. 1 

An illustration of linear projection based binary hashing, 
INDEXING, AND HASH TABLE GONSTRUCTION EOR FAST INVERSE 
LOOKUP. 


codes as /ife(x) e {—1,1}. One can easily convert them into 
binary codes from {0,1} as 


Vk = ^(l + ^fc(x)). 


(5) 


qc^ff(q) =0110 

folio 
1110 
0100 
^0111 

codes satisfying 
dniy-.Hiq)) < 1 


inverse Lookup 


__► 

0110 


1110 



1111 


Xi,X„ 


^2 


hash tabie returned items 


Fig. 2 

The procedure of inverse-lookup in hash table, where q is 
THE QUERY MAPPED TO A 4-BIT HASH CODE “0100” AND THE 
RETURNED APPROXIMATE NEAREST NEIGHBORS WITHIN HAMMING 
RADIUS 1 ARE Xi,X2,X„. 


Note that the Hamming distance can be rapidly computed 
using logical xor operation between binary codes as 


Without loss of generality, in this survey we will use the 
term hash codes to refer to either {0,1} or { — 1,1} form, 
which should be clear from the context. 

B.2 Indexing Using Hash Tables 

With a learned hash function, one can compute the bi¬ 
nary codes Y for all the items in a database. For K hash 
functions, the codes for the entire database cost only NK/8 
bytes. Assuming the original data to be stored in double¬ 
precision floating-point format, the original storage costs 
8ND bytes. Since the massive datasets are often asso¬ 
ciated with thousands of dimensions, the computed hash 
codes significantly reduce the storage cost by hundreds and 
even thousands of times. 

In practice, the hash codes of the database are orga¬ 
nized as an inverse-lookup, resulting in a hash table or a 
hash map. For a set of K binary hash functions, one can 
have at most 2^ entries in the hash table. Each entry, 
called a hash bucket, is indexed by a AT-bit hash code. In 
the hash table, one keeps only those buckets that contains 
at least one database item. Figure [1] shows an example 
of using binary hash functions to index the data and con¬ 
struct a hash table. Thus, a hash table can be seen as 
an inverse-lookup table, which can return all the database 
items corresponding to a certain code in constant time. 
This procedure is key to achieving speedup by many hash¬ 
ing based ANN search techniques. Since most of the buck¬ 
ets from 2^ possible choices are typically empty, creating 
an inverse lookup can be a very efficient way of even stor¬ 
ing the codes if multiple database items end up with the 
same codes. 

B.3 Online Querying with Hashing 

During the querying procedure, the goal is to find the 
nearest database items to a given query. The query is first 
converted into a code using the same hash functions that 
mapped the database items to codes. One way to find near¬ 
est neighbors of the query is by computing the Hamming 
distance between the query code to all the database codes. 


d'H{yi,yj) =yz©yj- 


( 6 ) 


On modern computer architectures, this is achieved efh- 
ciently by running xor instruction followed by popcount. 
With the computed Hamming distance between the query 
and each database item, one can perform exhaustive scan 
to extract the approximate nearest neighbors of the query. 
Although this is much faster than the exhaustive search 
in the original feature space, the time complexity is still 
linear. An alternative way of searching for the neighbors is 
by using the inverse-lookup in the hash table and returning 
the data points within a small Hamming distance r of the 
query. Specifically, given a query point q, and its corre¬ 
sponding hash code yq = H (q) , all the database points y 
whose hash codes fall within the Hamming ball of radius 
r centered at y,, i.,e. d'H{y,H{q)) ^ r. As shown in Fig¬ 
ure [2l for a AT-bit binary code, a total of (T) possible 
codes will be within Hamming radius of r. Thus one needs 
to search 0{K^) buckets in the hash table. The union of 
all the items falling into the corresponding hash buckets 
is returned as the search result. The inverse-lookup in a 
hash table has constant time complexity independent of the 
database size N. In practice, a small value of r (r = 1,2 is 
commonly used) is used to avoid the exponential growth in 
the possible code combinations that need to be searched. 

C. Randomized Hashing Methods 

Randomized hashing, e.g. locality sensitive hash family, 
has been a popular choice due to its simplicity. In addi¬ 
tion, it has interesting proximity preserving properties. A 
binary hash function h[-) from LSH family is chosen such 
that the probability of two points having the same bit is 
proportional to their (normalized) similarity, i.e.. 


P{h{xi) = h{xj)} = sim{xi,Xj). 


(7) 


Here sim(-, •) represents similarity between a pair of points 
in the input space, e.g., cosine similarity or Jaccard simi¬ 
larity [34] . In this section, we briefly review two categories 
of randomized hashing methods, i.e. random projection 
based and random permutation based approaches. 
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Fig. 3 

An illustration of random hyperplane partitioning based 
HASHING method. 


C.l Random Projection Based Hashing 

As a representative member of the LSH family, random 
projection based hash (RPH) functions have been widely 
used in different applications. The key ingredient of RPH is 
to map nearby points in the original space to the same hash 
bucket with a high probability. This equivalently preserves 
the locality in the original space in the Hamming space. 
Typical examples of RPH functions consist of a random 
projection w and a random shift b as 


=sgn(wfeX + 5fc), (8) 

The random vector w is constructed by sampling each com¬ 
ponent of w randomly from a standard Gaussian distribu¬ 
tion for cosine distance [34]. 

It is easy to show that the collision probability of two 
samples Xi,Xj falling into the same hash bucket is deter¬ 
mined by the angle 0ij between these two sample vectors, 
as shown in Figure [3| One can show that 

Pr[/ifc(xi) = /ifc(xj)] = 1-^ = 1 - -cos~^ (9) 

TT TT ||Xi||||Xj|| 

The above collision probability gives the asymptotic theo¬ 
retical guarantees for approximating the cosine similarity 
defined in the original space. However, long hash codes are 
required to achieve sufficient discrimination for high pre¬ 
cision. This significantly reduces the recall if hash table 
based inverse lookup is used for search. In order to bal¬ 
ance the tradeoff of precision and recall, one has to con¬ 
struct multiple hash tables with long hash codes, which in¬ 
creases both storage and computation costs. In particular, 
with hash codes of length K, it is required to construct a 
sufficient number of hash tables to ensure the desired per¬ 
formance bound [35]. Given I AT-bit tables, the collision 
probability is given as: 


P{H{xi) = H{xj)}(X 


I- 


1 x7x, 

-cos 


K 

( 10 ) 


To balance the search precision and recall, the length of 
hash codes should be long enough to reduce false collisions 
(i.e., non-neighbor samples falling into the same bucket). 


Meanwhile, the number of hash tables I should be suffi¬ 
ciently large to increase the recall. However, this is ineffi¬ 
cient due to extra storage cost and longer query time. 

To overcome these drawbacks, many practical systems 
adapt various strategies to reduce the storage overload and 
to improve the efficiency. For instance, a self-tuning in¬ 
dexing technique, called LSH forest was proposed in [35] , 
which aims at improving the performance without addi¬ 
tional storage and query overhead. In [37] [38], a technique 
called MultiProbe LSH was developed to reduce the num¬ 
ber of required hash tables through intelligently probing 
multiple buckets in each hash table. In [39] . nonlinear 
randomized Hadamard transforms were explored to speed 
up the LSH based ANN search for Euclidean distance. 
In [40], BayesLSH was proposed to combine Bayesian in¬ 
ference with LSH in a principled manner, which has prob¬ 
abilistic guarantees on the quality of the search results in 
terms of accuracy and recall. However, the random projec¬ 
tions based hash functions ignore the specific properties of 
a given data set and thus the generated binary codes are 
data-independent, which leads to less effective performance 
compared to the learning based methods to be discussed 
later. 

In machine learning and data mining community, re¬ 
cent methods tend to leverage data-dependent and task- 
specific information to improve the efficiency of random 
projection based hash functions m- For example, in¬ 
corporating kernel learning with LSH can help generalize 
ANN search from a standard metric space to a wide range 
of similarity functions [41] [42]. Furthermore, metric learn¬ 
ing has been combined with randomized LSH functions to 
explore a set of pairwise similarity and dissimilarity con¬ 
straints [19]. Other variants of locality sensitive hashing 
techniques include super-bit LSH [43], boosted LSH [18], 
as well as non-metric LSH [44] 

C.2 Random Permutation based Hashing 

Another well-known paradigm from the LSH family is 
min-wise independent permutation hashing (MinHash), 
which has been widely used for approximating Jaccard 
similarity between sets or vectors. Jaccard is a popular 
choice for measuring similarity between documents or im¬ 
ages. A typical application is to index documents and 
then identify near-duplicate samples from a corpus of doc¬ 
uments [IS] [35]. The Jaccard similarity between two sets 
Si and Sj is defined as J{Si,Sj) = Since a collec¬ 
tion of sets can be represented as a characteristic 

matrix C G where M is the cardinality of the uni¬ 

versal set 5i u • • • u Sn- Here the rows of C represents 
the elements of the universal set and the columns corre¬ 
spond to the sets. The element Cdi = 1 indicates the d-th 
element is a member of the i-th set, Cdi = 0 otherwise. As¬ 
sume a random permutation that assigns the index 

of the d-th element as 7rfe(d) G {I,-- - ,D}. It is easy to 
see that the random permutation satisfies two properties: 
7’‘fe(d) 7^ TTkil) and Pr[TTk{d) > nkil)] = 0.5. A random per¬ 
mutation based min-hash signature of a set Si is defined 
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as the minimum index of the non-zero element after per¬ 
forming permutation using TTfc 

hkiSi) = min 7rfe(d). (11) 

Note that such a hash function holds a property that the 
chance of two sets having the same MinHash values is equal 
to the Jaccard similarity between them m 

Pr[hkiS,) = hkiSj)] = JiS,,Sj). ( 12 ) 

The definition of the Jaccard similarity can be extended 
to two vectors = {xn,--- ,Xid,--- ^Xio} and Xj = 
{xj ±, * * * , Xjd ; * * * 5 ^jD } as 

JU. l]d=l^HXtd,Xjd) 

^ 5 ) — D 

Y,^^^max{xid,Xjd) 

Similar min-hash functions can be defined for the above 
vectors and the property of the collision probability shown 
in Eq. [T^ still holds [15] . Compared to the random pro¬ 
jection based LSH family, the min-hash functions generate 
non-binary hash values that can be potentially extended 
to continuous cases. In practice, the min-hash scheme 
has shown powerful performance for high-dimensional and 
sparse vectors like the bag-of-word representation of docu¬ 
ments or feature histograms of images. In a large scale eval¬ 
uation conducted by Google Inc., the min-hash approach 
outperforms other competing methods for the application 
of webpage duplicate detection [49]. In addition, the min- 
hash scheme is also applied for Google news personal¬ 
ization [50] and near duplicate image detection m m- 
Some recent efforts have been made to further improve 
the min-hash technique, including 6-bit minwise hash¬ 
ing [53] [54], one permutation approach [55], geometric 
min-Hashing |56j . and a fast computing technique for im¬ 
age data m- 

III. Categories of Learning Based Hashing 
Methods 

Among the three key steps in hashing-based ANN 
search, design of improved data-dependent hash functions 
has been the focus in learning to hash paradigm. Since the 
proposal of LSH in [58] , many new hashing techniques have 
been developed. Note that most of the emerging hashing 
methods are focused on improving the search performance 
using a single hash table. The reason is that these tech¬ 
niques expect to learn compact discriminative codes such 
that searching within a small Hamming ball of the query 
or even exhaustive scan in Hamming space is both fast and 
accurate. Hence, in the following, we primarily focus on 
various techniques and algorithms for designing a single 
hash table. In particular, we provide different perspectives 
such as the learning paradigm and hash function charac¬ 
teristics to categorize the hashing approaches developed 
recently. It is worth mentioning that a few recent stud¬ 
ies have shown that exploring the power of multiple hash 
tables can sometimes generate superior performance. In 


order to improve precision as well as recall, Xu et ah, de¬ 
veloped multiple complementary hash tables that are se¬ 
quentially learned using a boosting-style algorithm |31] . 
Also, in cases when the code length is not very large and 
the number of database points is large, exhaustive scan in 
Hamming space can be done much faster by using multi¬ 
table indexing as shown by Norouzi et al. [59] . 

A. Data-Dependent vs. Data-Independent 

Based on whether design of hash functions requires anal¬ 
ysis of a given dataset, there are two high-level cate¬ 
gories of hashing techniques: data-independent and data- 
dependent. As one of the most popular data-independent 
approaches, random projection has been used widely for 
designing data-independent hashing techniques such as 
LSH and SIKH mentioned earlier. LSH is arguably the 
most popular hashing method and has been applied to 
a variety of problem domains, including information re¬ 
trieval and computer vision. In both LSH and SIKH, 
the projection vector w and intersect 6, as defined in 
Eq.® are randomly sampled from certain distributions. 
Although these methods have strict performance guaran¬ 
tees, they are less efficient since the hash functions are 
not specifically designed for a certain dataset or search 
task. Based on the random projection scheme, there have 
been several efforts to improve the performance of the LSH 

method mmm- 

Realizing the limitation of data-independent hashing 
approaches, many recent methods use data and possibly 
some form of supervision to design more efficient hash 
functions. Based on the level of supervision, the data- 
dependent methods can be further categorized into three 
subgroups as described below. 

B. Unsupervised, Supervised, and Semi-Supervised 

Many emerging hashing techniques are designed by ex¬ 
ploiting various machine learning paradigms, ranging from 
unsupervised and supervised to semi-supervised settings. 
For instance, unsupervised hashing methods attempt to 
integrate the data properties, such as distributions and 
manifold structures to design compact hash codes with 
improved accuracy. Representative unsupervised methods 
include spectral hashing [35], graph hashing [55], mani¬ 
fold hashing m, iterative quantization hashing [61] . ker- 
nalized locality sensitive hashing mm, isotropic hash¬ 
ing [35], angular quantization hashing [33], and spherical 
hashing [^. Among these approaches, spectral hashing 
explores the data distribution and graph hashing utilizes 
the underlying manifold structure of data captured by 
a graph representation. In addition, supervised learning 
paradigms ranging from kernel learning to metric learning 
to deep learning have been exploited to learn binary codes, 
and many supervised hashing methods have been proposed 
recently [T3] [55] [53] [53] [57] [35] . Finally, semi-supervised 
learning paradigm was employed to design hash functions 
by using both labeled and unlabeled data. For instance, 
Wang et. al proposed a regularized objective to achieve 
accurate yet balanced hash codes to avoid overfitting M- 
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Fig. 4 

An illustration of different levels of supervised 

INFORMATION: A) PAIRWISE LABELS; b) A TRIPLET 
sim(q,X+) > sim(q,x^); AND c) A DISTANCE BASED RANK LIST 
(X4,X1,X2,X3) TO A QUERY POINT q. 




Fig. 5 

Comparison of hash bits generated using a) PCA hashing and 
b) Spectral Hashing. 


D. Linear vs. Nonlinear 


In [55] [70], authors proposed to exploit the metric learn¬ 
ing and locality sensitive hashing to achieve fast similarity 
based search. Since the labeled data is used for deriv¬ 
ing optimal metric distance while the hash function design 
uses no supervision, the proposed hashing technique can 
be regarded as a semi-supervised approach. 


C. Pointwise, Pairwise, Triplet-wise and Listwise 


4(a) the pair (x 2 ,X 3 ) con- 


Based on the level of supervision, the supervised or semi- 
supervised hashing methods can be further grouped into 
several subcategories, including pointwise, pairwise, triplet- 
wise, and listwise approaches. For example, a few existing 
approaches utilize the instance level semantic attributes 
or labels to design the hash functions miilHjlZl]- Addi¬ 
tionally, learning methods based on pairwise supervision 
have been extensively studied, and many hashing tech¬ 
niques have been proposed 
As demonstrated in Figure 
tains similar points and the other two pairs (xi,X 2 ) and 
(xijXa) contain dissimilar points. Such relations are con¬ 
sidered in the learning procedure to preserve the pairwise 
label information in the learned Hamming space. Since 
the ranking information is not fully utilized, the perfor¬ 
mance of pairwise supervision based methods could be 
sub-optimal for nearest neighbor search. More recently, 
a triplet ranking that encodes the pairwise proximity com¬ 
parison among three data points is exploited to design hash 
codes IZl][65]|Z5]. As shown in Figure [4(b)[ the point x+ is 
more similar to the query point q than the point x“. Such 
a triplet ranking information, i.e., si 77 i(q,x+) > sim(q,x~) 
is expected to be encoded in the learned binary hash codes. 
Finally, the listwise information indicates the rank order 
of a set of points with respect to the query point. In Fig¬ 
ure 4(c)[ for the query point q, the rank list (x 4 ,Xi,X 2 ,X 3 ) 
shows the ranking order of their similarities to the query 
point q, where X 4 is the nearest point and X 3 is the farthest 
one. By converting rank lists to a triplet tensor matrix, 
listwise hashing is designed to preserve the ranking in the 
Hamming space |76j . 


Based on the form of function /(•) in Eq. 0] hash func¬ 
tions can also be categorized in two groups: linear and 
nonlinear. Due to their computational efficiency, linear 
functions tend to be more popular, which include ran¬ 
dom projection based LSH methods. The learning based 
methods derive optimal projections by optimizing different 
types of objectives. For instance, PCA hashing performs 
principal component analysis on the data to derive large 
variance projections [63][77][78], as shown in Figure 
In the same league, supervised methods have used Lin¬ 
ear Discriminant Analysis to design more discriminative 
hash codes [zaiiSQ]. Semi-supervised hashing methods es¬ 
timate the projections that have minimum empirical loss 
on pair-wise labels while partitioning the unlabeled data 
in a balanced way |14j . Techniques that use variance of the 
projections as the underlying objective, also tend to use or¬ 
thogonality constraints for computational ease. However, 
these constraints lead to a significant drawback since the 
variance for most real-world data decays rapidly with most 
of the variance contained only in top few directions. Thus, 
in order to generate more bits in the code, one is forced 
to use progressively low-variance directions due to orthog¬ 
onality constraints. The binary codes derived from these 
low-variance projections tend to have significantly lower 
performance. Two types of solutions based on relaxation 
of the orthogonality constraints or random/learned rota¬ 
tion of the data have been proposed in the literature to 
address these issues mm- Isotropic hashing is proposed 
to derive projections with equal variances and is shown to 
be superior to anisotropic variances based projections [62] . 
Instead of performing one-shot learning, sequential projec¬ 
tion learning derives correlated projections with the goal 
of correcting errors from previous hash bits |25] . Finally, 
to reduce the computational complexity of full projection, 
circulant binary embedding was recently proposed to sig¬ 
nificantly speed up the encoding process using the circulant 
convolution [ 81 j . 

Despite its simplicity, linear hashing often suffers from 
insufficient discriminative power. Thus, nonlinear meth¬ 
ods have been developed to override such limitations. For 
instance, spectral hashing first extracts the principal pro¬ 
jections of the data, and then partitions the projected data 
by a sinusoidal function (nonlinear) with a specific angu¬ 
lar frequency. Essentially, it prefers to partition projec- 
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tions with large spread and small spatial frequency such 
that the large variance projections can be reused. As il¬ 
lustrated in Figure [5(b)[ the fist principal component can 
be reused in spectral hashing to divide the data into four 
parts while being encoded with only one bit. In addition, 
shift-invariant kernel-based hashing chooses /(•) to be a 
shifted cosine function and samples the projection vector 
in the same way as standard LSH does [53] . Another cate¬ 
gory of nonlinear hashing techniques employs kernel func¬ 
tions [5I]|ll][lH]|HH- Anchor graph hashing proposed by 
Liu et al. [55] uses a kernel function to measure similarity 
of each points with a set of anchors resulting in nonlinear 
hashing. Kernerlized LSH uses a sparse set of datapoints 
to compute a kernel matrix and preform random projection 
in the kernel space to compute binary codes m- Based on 
similar representation of kernel metric, Kulis and Darrell 
propose learning of hash functions by explicitly minimizing 
the reconstruction error in the kernel space and Hamming 
space m- Liu et al. applies kernel representation but 
optimizes the hash functions by exploring the equivalence 
between optimizing the code inner products and the Ham¬ 
ming distances to achieve scale invariance [55] . 

E. Single-Shot Learning vs. Multiple-Shot Learning 

For learning based hashing methods, one first formu¬ 
lates an objective function reflecting desired characteris¬ 
tics of the hash codes. In a single-shot learning paradigm, 
the optimal solution is derived by optimizing the objective 
function in a single-shot. In such a learning to hash frame¬ 
work, the K hash functions are learned simultaneously. In 
contrast, the multiple-shot learning procedure considers a 
global objective, but optimizes a hash function considering 
the bias generated by the previous hash functions. Such 
a procedure sequentially trains hash functions one bit at 
a time [S3] [5^ [53] . The multiple-shot hash function learn¬ 
ing is often used in supervised or semi-supervised settings 
since the given label information can be used to assess the 
quality of the hash functions learned in previous steps. For 
instance, the sequential projection based hashing aims to 
incorporate the bit correlations by iteratively updating the 
pairwise label matrix, where higher weights are imposed on 
point pairs violated by the previous hash functions |25j . In 
the complementary projection learning approach |84j . the 
authors present a sequential learning procedure to obtain 
a series of hash functions that cross the sparse data re¬ 
gion, as well as generate balanced hash buckets. Column 
generation hashing learns the best hash function during 
each iteration and updates the weights of hash functions 
accordingly. Other interesting learning ideas include two- 
step learning methods which treat hash bit learning and 
hash function learning separately [85] [86] . 

F. Non- Weighted vs. Weighted Hashing 

Given the Hamming embedding defined in Eq. [S] tra¬ 
ditional hashing based indexing schemes map the original 
data into a non-weighted Hamming space, where each bit 
contributes equally. Given such a mapping, the Hamming 
distance is calculated by counting the number of different 


bits. However, it is easy to observe that different bits often 
behave differently mm- In general, for linear projection 
based hashing methods, the binary code generated from 
large variance projection tends to perform better due to 
its superior discriminative power. Hence, to improve dis¬ 
crimination among hash codes, techniques were designed 
to learn a weighted hamming embedding as 

H : X ^ {aihi{x),--- ,aKhK{x)}. (13) 

Hence the conventional hamming distance is replaced by a 
weighted version as 

K 

dwH = 2 afc|/ifc(xi) - hk{yij)\. (14) 

k = l 

One of the representative approaches is Boosted Similarity 
Sensitive Goding (BSSG) [T5]. By learning the hash func¬ 
tions and the corresponding weights {ai,-- - ,ak} jointly, 
the objective is to lower the collision probability of non¬ 
neighbor pair (xi, Xj) G C while improving the collision 
probability of neighboring pair (xi,Xj) G AI. If one treats 
each hash function as a decision stump, the straightfor¬ 
ward way of learning the weights is to directly apply adap¬ 
tive boosting algorithm m as described in [T5]. In [55] . 
a boosting-style method called BoostMAP is proposed 
to map data points to weighted binary vectors that can 
leverage both metric and semantic similarity measures. 
Other weighted hashing methods include designing spe¬ 
cific bit-level weighting schemes to improve the search ac¬ 
curacy [73][HS][SS][SI][ni]- In addition, a recent work about 
designing a unified bit selection framework can be regarded 
as a special case of weighted hashing approach, where the 
weights of hash bits are binary [^ . Another effective hash 
code ranking method is the query-sensitive hashing, which 
explores the raw feature of the query sample and learns 
query-specific weights of hash bits to achieve accurate e- 
nearest neighbor search [53] . 

IV. Methodology Review and Analysis 

In this section, we will focus on review of several rep¬ 
resentative hashing methods that explore various ma¬ 
chine learning techniques to design data-specific indexing 
schemes. The techniques consist of unsupervised, semi- 
supervised, as well as supervised approaches, including 
spectral hashing, anchor graph hashing, angular quanti¬ 
zation, binary reconstructive embedding based hashing, 
metric learning based hashing, semi-supervised hashing, 
column generation hashing, and ranking supervised hash¬ 
ing. Table llll summarizes the surveyed hashing techniques, 
as well as their technical merits. 

Note that this section mainly focuses on describing 
the intuition and formulation of each method, as well 
as discussing their pros and cons. The performance 
of each individual method highly depends on practical 
settings, including learning parameters and dataset it¬ 
self. In general, the nonlinear and supervised techniques 
tend to generate better performance than linear and un¬ 
supervised methods, while being more computationally 

costly [13] [IS] [5T] [55] [57]. 
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Method 

Hash Function/Objective Function 

Parameters 

Learning Paradigm 

Supervision 

Spectral Hashing 

sgn(cos(Q!w''"x)) 

w,a 

unsupervised 

NA 

Anchor Graph Hashing 

sgn(w'''x) 

w 

unsupervised 

NA 

Angular Quantization 

(b,w) argmaxXi, ibti^wTx, 

b, w 

unsupervised 

NA 

Binary Reconstructive Embedding 

sgn(w''"A(x)) 

w 

unsupervised / supervised 

pairwise distance 

Metric Learning Hashing 

sgn(w'''G'''x) 

G,w 

supervised 

pairwise similarity 

Semi-Supervised Hashing 

sgn(w'''x) 

w 

semi-supervised 

pairwise similarity 

Column generation hashing 

sgn(w''"x + b) 

w 

supervised 

triplet 

Listwise hashing 

sgn(w'''x -t b) 

w 

supervised 

ranking list 

Circulant binary embedding 

sgn(drc(r) • x) 

r 

unsupervised / supervised 

pairwise similarity 


TABLE H 

A SUMMARY OF THE SURVEYED HASHING TECHNIQUES IN THIS ARTICLE. 


A. Spectral Hashing 

In the formulation of spectral hashing, the desired prop¬ 
erties include keeping neighbors in input space as neigh¬ 
bors in the hamming space and requiring the codes to be 
balanced and uncorrelated [32] ■ Hence, the objective of 
spectral hashing is formulated as: 

= ^tr(YTLY) (15) 

subject to: Ye { — 1,1}^^^ 

l'^yfc =0, k = l,---,K 
Y^Y = nlK.K, 

where A = {Aij}Y^^ is a pairwise similarity matrix and 
the Laplacian matrix is calculated as L = diag(Al) — A. 
The constraint = 0 ensures that the hash bit jk 

reaches a balanced partitioning of the data and the con¬ 
straint YTy 

= uIkxk imposes orthogonality between 
hash bits to minimize the redundancy. 

The direct solution for the above optimization is non¬ 
trivial for even a single bit since it is essentially a balanced 
graph partition problem, which is NP hard. The orthog¬ 
onality constraints for A-bit balanced partitioning make 
the above problem even harder. Motivated by the well- 
known spectral graph analysis [SS], the authors suggest to 
minimize the cost function with relaxed constraints. In 
particular, with the assumption of uniform data distribu¬ 
tion, the spectral solution can be efficiently computed us¬ 
ing ID-Laplacian eigenfunctions [32] . The final solution for 
spectral hashing equals to apply a sinusoidal function with 
pre-computed angular frequency to partition data along 
PC A directions. Note that the projections are computed 
using data but learned in an unsupervised manner. As 
most of the orthogonal projection based hashing methods, 
spectral hashing suffers from the low-quality binary cod¬ 
ing using low-variance projections. Hence, a “kernel trick” 
is used to alleviate the degraded performance when using 
long hash bits [96]. Moreover, the assumption of uniform 
data distribution usually hardly hold for real-world data. 


B. Anchor Graph Hashing 

Following the similar objective as spectral hashing, an¬ 
chor graph hashing was designed to solve the problem from 
a different perspective without the assumption of uniform 
distribution [35] • Note that the critical bottleneck for solv¬ 
ing Eq. |T5]is the cost of building a pariwsie similarity graph 
A, the computation of associated graph Laplacian, as well 
as solving the corresponding eigen-system, which at least 
has a quadratic complexity. The key idea is to use a small 
set of M {M « N) anchor points to approximate the graph 
structure represented by the matrix A such that the sim¬ 
ilarity between any pair of points can be approximated 
using point-to-anchor similarities m- In particular, the 
truncated point-to-anchor similarity Z G ]^nxm gjygg 
similarities between N database points to the M anchor 
points. Thus, the approximated similarity matrix A can 
be calculated as A = ZAZ"'", where A = diag(Zl) is the 
degree matrix of the anchor graph Z. Based on such an ap¬ 
proximation, instead of solving the eigen-system of the ma¬ 
trix A = ZAZ"'", one can alternatively solve a much smaller 
eigen-system with an M x M matrix A^/^Z''"ZA^/^. The 
final binary codes can be obtained through calculating the 
sign function over a spectral embedding as 

Y = sgn(ZAi/2VS^/2), (16) 

Here we have the matrices V = [vi, • • • , , • • • , v^] G 
]^MxK Y; = diag(cri, • • • ,CTk,-" ,<Jk) g where 

{vfc,crfc} are the eigenvector-eigenvalue pairs [35]. Figure |6| 
shows the two-bit partitioning on a synthetic data with 
nonlinear structure using different hashing methods, in¬ 
cluding spectral hashing, exact graph hashing, and anchor 
graph hashing. Note that since spectral hashing computes 
two smoothest pseudo graph Laplacian eigenfunctions in¬ 
stead of performing real spectral embedding, it can not 
handle such type of nonlinear data structures. The exact 
graph hashing method first constructs an exact neighbor¬ 
hood graph, e.g., fcNN graph, and then performs parti¬ 
tioning with spectral techniques to solve the optimization 
problem in Eo llSI The anchor graph hashing archives a 
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(d) (e) (f) 


Fig. 6 

Comparison of partitioning a two-moon data by the first 

TWO HASH bits USING DIFFERENT METHODS: a) THE FIRST BIT USING 
SPECTRAL HASHING; b) THE FIRST BIT USING EXACT GRAPH HASHING; 
G) THE FIRST BIT USING ANCHOR GRAPH HASHING; d) THE SECOND 
BIT USING SPECTRAL HASHING; e) THE SECOND BIT USING EXACT 
GRAPH HASHING; f) THE SEGOND BIT USING ANCHOR GRAPH HASHING; 


good separation (by the first bit) of the nonlinear mani¬ 
fold and balancing partitioning, even performs better than 
the exact graph hashing, which loses the property of bal¬ 
ancing partitioning for the second bit. The anchor graph 
hashing approach was recently further improved by lever¬ 
aging a discrete optimization technique to directly solve 
binary hash codes without any relaxation |98j . 


C. Angular Quantization Based Hashing 

Since similarity is often measured by the cosine of the 
angle between pairs of samples, angular quantization is 
thus proposed to map non-negative feature vectors onto a 
vertex of the binary hypercube with the smallest angle |63j . 
In such a setting, the vertices of the hypercube is treated as 
quantization landmarks that grow exponentially with the 
data dimensionality D. As shown in Figured the nearest 
binary vertex b in a hypercube to the data point x is given 

by 


b* 


b^x 

= argmax 

b ||b||2 


subject to: bG{0,l}^, 


(17) 


Although it is an integer programming problem, its global 
maximum can be found with a complexity of 0{D\ogD). 
The optimal binary vertices will be used as the binary hash 
codes for data points as y = b*. Based on this angular 
quantization framework, a data-dependent extension is de¬ 
signed to learn a rotation matrix R G to align the 

projected data R'''x to the binary vertices without chang¬ 
ing the similarity between point pairs. The objective is 



Fig. 7 

Illustration of angular quantization based hashing 
METHOD [63|. The binary code OE a data point X IS assigned as 
the nearest binary vertex in the hypercube, which is 
b4 = [0 1 1]T in the illustrated example [63]. 


formulated as the following 

(b*,R*) = argmax^-^R"^x, (18) 

subject to: b G {0,1}^ 

R''^R = 

Note that the above formulation still generates a D-hit bi¬ 
nary code for each data point, while compact codes are 
often desired in many real-world applications [T^ . To gen¬ 
erate a Ar=bit code, a projection matrix S G with 

orthogonal columns can be used to replace the rotation 
matrix R in the above objective with additional normal¬ 
ization, as discussed in |63) . Finally, the optimal binary 
codes and the projection/rotation matrix are learned us¬ 
ing an alternating optimization scheme. 

D. Binary Reconstructive Embedding 

Instead of using data-independent random projections as 
in LSH or principal components as in SH, Kulis and Dar¬ 
rell [57] proposed data-dependent and bit-correlated hash 
functions as: 

/ife(x) = sgn WfcqK(xfe,,x)^ (19) 

The sample set {xkq},q = I,-- - ,s is the training data for 
learning hash function hk and k(-) is a kernel function, and 
W is a weight matrix. 

Based on the above formulation, a method called Bi¬ 
nary Reconstructive Embedding {BRE) was designed to 
minimize a cost function measuring the difference between 
the metric and reconstructed distance in hamming space. 
The Euclidean metric and the binary reconstruction 
distance dn are defined as: 

dA4(x*,Xj) = i||Xi-Xjp (20) 

1 ^ 2 
d77(x*,Xj) = XI (^*(^0 - hk{y^j)f 
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Illustration of the hashing method based on metric 

LEARNING. ThE LEFT SHOWS THE PARTITIONING USING STANDARD 
LSH METHOD AND THE RIGHT SHOWS THE PARTITIONING OF THE 
METRIC LEARNING BASED LSH METHOD (MODIFIED THE ORIGINAL 
FIGURE IN m)- 


The objective is to minimize the following reconstruction 
error to derive the optimal W: 

W*=argi:mn ^ - d 7 ^(x,,Xj)]^ , (21) 

(xi.xjOeW 

where the set of sample pairs A/" is the training data. 
Optimizing the above objective function is difficult due 
to the non-differentiability of sgn(-) function. Instead, a 
coordinate-descent algorithm was applied to iteratively up¬ 
date the hash functions to a local optimum. This hashing 
method can be easily extended to a supervised scenario 
by setting pairs with same labels to have zero distance 
and pairs with different labels to have a large distance. 
However, since the binary reconstruction distance d-jz is 
bounded in [0,1] while the metric distance has no up¬ 
per bound, the minimization problem in Eq. m is only 
meaningful when input data is appropriately normalized. 
In practice, the original data point x is often mapped to 
a hypersphere with unit length so that 0 ^ dx ^ 1. This 
normalization removes the scale of data points, which is 
often not negligible for practical applications of nearest 
neighbor search. In addition, Hamming distance based 
objective is hard to optimize due to its nonconvex and 
nonsmooth properties. Hence, Liu et al. proposed to uti¬ 
lize the equivalence between code inner products and the 
Hamming distances to design supervised and kernel-based 
hash functions [22]. The objective is to ensure the inner 
product of hash codes consistent with the given pairwise 
supervision. Such a strategy of optimizing the hash code 
inner product in KSH rather than the Hamming distance 
like what’s done in BRE pays off nicely and leads to major 
performance gains in similarity-based retrieval consistently 
confirmed in extensive experiments reported in |22] and re¬ 
cent studies [99]. 

E. Metric Learning based Hashing 

The key idea for metric learning based hashing method 
is to learn a parameterized Mahalanobis metric using pair¬ 
wise label information. Such learned metrics are then em¬ 
ployed to the standard random projection based hash func¬ 
tions [19] . The goal is to preserve the pairwise relationship 
in the binary code space, where similar data pairs are more 
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Fig. 9 

Illustration of one-bit partitioning of different linear 
PROJECTION BASED HASHING METHODS: A) UNSUPERVISED HASHING; 
b) supervised HASHING; AND C) SEMI-SUPERVISED HASHING. ThE 
SIMILAR POINT PAIRS ARE INDICATED IN THE GREEN RECTANGLE 
SHAPE AND THE DISSIMILAR POINT PAIRS ARE WITH A RED TRIANGLE 

SHAPE. 


likely to collide in the same hash buck and dissimilar pairs 
are less likely to share the same hash codes, as illustrated 
in EigurelU 

The parameterized inner product is defined as 
sim{xi,Xj) = xJMxj, 

where M is a positive-definite d x d matrix to be learned 
from the labeled data. Note that this similarity mea¬ 
sure corresponds to the parameterized squared Maha¬ 
lanobis distance d^.. Assume that M can be factorized 
as M = G^G. Then the parameterized squared Maha¬ 
lanobis distance can be written as 

dM(xi,Xj) = (xi-Xj)"^M(xi-Xj) (22) 

= (Gx,-Gx,)T(Gx,-Gx,). 

Based on the above equation, the distance dM(xi,Xj) can 
be interpreted as the Euclidian distance between the pro¬ 
jected data points Gx^ and Gxj. Note that the matrix 
M can be learned through various metric learning method 
such as information-theoretic metric learning |100j . To ac¬ 
commodate the learned distance metric, the randomized 
hash function is given as 

/i/c(x) = sgn(wfcG’^x). (23) 

It is easy to see that the above hash function generates 
the hash codes which preserve the parameterized similarity 
measure in the Hamming space. Figure |8] demonstrates the 
difference between standard random projection based LSH 
and the metric learning based LSH, where it is easy to see 
that the learned metric help assign the same hash bit to the 
similar sample pairs. Accordingly, the collision probability 
is given as 

1 x^ G^ Gx 

Pr[hfe(xO = fefe(x.)] = I--cos-^ - ^ (24) 

TT ||Gx4||Gxj|| 

Realizing that the pairwise constraints often come to be 
available incrementally, Jain et al exploit an efficient online 
locality-sensitive hashing with gradually learned distance 

metrics m- 
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F. Semi-Supervised Hashing 


G. Column Generation Hashing 


Supervised hashing techniques have been shown to be 
superior than unsupervised approaches since they lever¬ 
age the supervision information to design task-specific hash 
codes. However, for a typical setting of large scale prob¬ 
lem, the human annotation process is often costly and the 
labels can be noisy and sparse, which could easily lead to 
overfitting. 

Considering a small set of pairswise labels and a 
large amount of unlabled data, semi-supervised hashing 
aims in designing hash functions with minimum empiri¬ 
cal loss while maintaining maximum entropy over the en¬ 
tire dataset. Specifically, assume the pairwise labels are 
given as two type of sets A4 and C. A pair of data point 
(xi ,Xj) e A4 indicates that and xj are similar and 
(xi,Xj) gC means that Xi and xj. Hence, the empirical 
accuracy on the labeled data for a family of hash functions 
H = [hi,• • • , hx] is given as 




2 hk{xi)hkixj)-'^hk{xi)hk{xj) 
{xi,y:j)eAi {xi,Xj)eC 


.(25) 


If define a matrix S G incorporating the pairwise la¬ 
beled information from X; as: 

f 1 : {xi,Xj)eM 

= j -1 : (x„x,)gC , (26) 

[ 0 : otherwise. 

The above empirical accuracy can be written in a compact 
matrix form after dropping off the sgn(-) function 

J(H) = itr (W^X; S W^X^). (27) 

However, only considering the empirical accuracy during 
the design of the hash function can lead to undesired re¬ 
sults. As illustrated in Figure |9(b)[ although such a hash 
bit partitions the data with zero error over the pairwise 
labeled data, it results in imbalanced separation of the un¬ 
labeled data, thus being less informative. Therefore, an 
information theoretic regularization is suggested to max¬ 
imize the entropy of each hash bit. After relaxation, the 
final objective is formed as 

W* = argnmxitr (W'^XzSX^W) + |tr (W'^XX^W), 

(28) 


where the first part represents the empirical accuracy and 
the second component encourages partitioning along large 
variance projections. The coefficient p weighs the contri¬ 
bution from these two components. The above objective 
can be solved using various optimization strategies, result¬ 
ing in orthogonal or correlated binary codes, as described 
in [14] [25]. Figure [9] illustrates the comparison of one-bit 
linear partition using different learning paradigms, where 
the semi-supervised method tends to produce balanced yet 
accurate data separation. Finally, Xu et al. employs simi¬ 
lar semi-supervised formulation to sequentially learn mul¬ 
tiple complementary hash tables to further improve the 
performance m- 


Beyond pairwise relationship, complex supervision like 
ranking triplets and ranking lists has been exploited to 
learn hash functions with the property of ranking preserv¬ 
ing. In many real applications such as image retrieval and 
recommendation system, it is often easier to receive the 
relative comparison instead of instance-wise or pari-wise 
labels. For a general claim, such relative comparison infor¬ 
mation is given in a triplet form. Formally, a set of triplets 
are represented as: 


£ = {(q,,x+,x^ )|s*m(qi,x+) > sim{qi,x^ )}, 

where the the function sim{-) could be an unknown sim¬ 
ilarity measure. Hence, the triplet (qi,x('",x“) indicates 
that the sample point xf is more semantically similar or 
closer to a query point q than the point x“, as demon¬ 
strated in Figure [4(b)] 

As one of the representative methods falling into this 
category, column generation hashing explores the large- 
margin framework to leverage such type of proximity 
comparison information to design weighted hash func¬ 
tions m- In particular, the relative comparison infor¬ 
mation sim{qi,xil) > sim{qi,x~) will be preserved in a 
weighted Hamming space as < <^ww(qDX“), 

where dww is the weighted Hamming distance as de¬ 
fined in Eq. 1141 To impose a large-margin, the constraint 
< '^ww(qi:X“) should be satished as well as 
possible. Thus, a typical large-margin objective with £i 
norm regularization can be formulated as 

|£| 

argmin -fCllwlli (29) 

subject to: w > 0,^ > 0; 

c^ww(qi,Xj ) — dwuiqij^i ) ^ 1 — 

where w is the random projections for computing the hash 
codes. To solve the above optimization problem, the au¬ 
thors proposed using column generation technique to learn 
the hash function and the associated bit weights iteratively. 
For each iteration, the best hash function is generated and 
the weight vector is updated accordingly. In addition, dif¬ 
ferent loss functions with other regularization terms snch 
as foo are also suggested as alternatives in the above for¬ 
mulation. 

H. Ranking Supervised Hashing 

Different from other methods that explore the triplet 
relationship [74j|75][65]. the ranking supervised hashing 
method attempt to preserve the ranking order of a set of 
database points corresponding to the query point [T^ . As¬ 
sume that the training dataset X = {x„} has N points with 
x„ G R-^. In addition, a query set is given as Q = (qm}, 
and qm G R^,m = 1, • • • ,M. For any specific query point 
qm, we can derive a ranking list over A, which can be writ¬ 
ten as a vector as r(qm,A) = (r™,-- - - ,r)0). Each 

element r™ falls into the integer range [1, A] and no two el¬ 
ements share the same value for the exact ranking case. If 
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Fig. 10 

The conceptual diagram oe the Rank Supervised Hashing 
METHOD. The top LETT COMPONENT DEMONSTRATES THE 
PROCEDURE OF DERIVING GROUND TRUTH RANKING LIST r USING THE 
SEMANTIC RELEVANCE OR FEATURE SIMILARITY/dISTANCE, AND THEN 
CONVERTING IT TO A triplet matrix S(q) FOR A GIVEN QUERY q. The 
BOTTOM LEFT COMPONENT DESCRIBES THE ESTIMATION OF RELAXED 
RANKING triplet matrix FROM THE binary HASH GODES. The 
RIGHT COMPONENT SHOWS THE OBJECTIVE OF MINIMIZING THE 
INCONSISTENCY BETWEEN THE TWO RANKING triplet matrices. 


<r'^ {i,j = l,-",N),it indicates sample has higher 
rank than Xj, which means is more relevant or similar 
to Qm than Xj. To represent such a discrete ranking list, a 
ranking triplet matrix S G is defined as 

f 1 : 

5'(q™;Xi,Xj) = ^ -1 : rf > rj (30) 

I 0 : rf=rj. 

Hence for a set of query points Q = {qm}, we can derive a 
triplet tensor, i.e., a set of triplet matrices 

S = {S(q^)} G ]^MxNxN_ 

In particular, the element of the triplet tensor is 
defined as = S(q^)(z,j) = S'(qm; x^, x^-), m = 

1, ••• ,M, i, j = 1, ••• ,N. The objective is to 
preserve the ranking lists in the mapped Hamming 
space. In other words, if S{qm', 'x.i,Xj) = 1, we 
tend to ensure dqj{c[m,^i) < dw(qm,Xi), otherwise 
d-H{(lm,^i) > (iw(qm,Xi). Assume the hash code has the 
value as { — 1,1}, such ranking order is equivalent to the 
similarity measurement using the inner products of the bi¬ 
nary codes, i.e., 


rf-H (qm, Xi )<d-H (qm, Xi )^H H {xi)>H (q^) i? (Xj ). 

Then the empirical loss function Lpi over the ranking list 
can be represented as 

Lu = [^(xi) - H{xj)\Smir 

m i,j 


Assume we utilize linear hash functions, the final objective 
is formed as the following constrained quadratic problem 

W* = argmaxL// = argmaxtr(WW'''B) (31) 

s.t. WTW = I, 

where the constant matrix B is computed as B = 
SmPmqm with Pm = Tn,j [xi - Xj] ^^ij ■ The orthogo¬ 
nality constraint is utilized to minimize the redundancy 
between different hash bits. Intuitively, the above formula¬ 
tion is to preserve the ranking list in the Hamming space, 
as shown in the conceptual diagram in Figure 1101 The 
augmented Lagrangian multiplier method was introduced 
to derive feasible solutions for the above constrained prob¬ 
lem, as discussed in m- 

I. Circulant Binary Embedding 

Realizing that most of the current hashing techniques 
rely on linear projections, which could suffer from very high 
computational and storage costs for high-dimensional data, 
circulant binary embedding was recently developed to han¬ 
dle such a challenge using the circulant projection [8T| . 
Briefly, given a vector r = {rov we can gener¬ 

ate its corresponding circulant matrix R = cjrc(r) m- 
Therefore, the binary embedding with the circulant pro¬ 
jection is defined as: 

h{x) = sgn(Rx) = sgn(circ(r) ■ x). (32) 

Since the circulant projection circ{r)x is equivalent to cir¬ 
cular convolution r ® x, the computation of linear projec¬ 
tion can be eventually realized using fast Fourier transform 
as 


circ{v)x = v®x = E ^ {F {v) o E {x)). (33) 

Thus, the time complexity is reduced from d^ to dlogd. 
Finally, one could randomly select the circulant vector r 
or design specific ones using supervised learning methods. 

V. Deep Learning for Hashing 

During the past decade (since around 2006), Deep Learn¬ 
ing nna, also known as Deep Neural Networks, has drawn 
increasing attention and research efforts in a variety of arti¬ 
ficial intelligence areas including speech recognition, com¬ 
puter vision, machine learning, text mining, etc. Since one 
main purpose of deep learning is to learn robust and power¬ 
ful feature representations for complex data, it is very nat¬ 
ural to leverage deep learning for exploring compact hash 
codes which can be regarded as binary representations of 
data. In this section, we briefly introduce several recently 
proposed hashing methods that employ deep learning. In 
Table mi we compare eight deep learning based hashing 
methods in terms of four key characteristics that can be 
used to differentiate the approaches. 

The earliest work in deep learning based hashing may 
be Semantic Hashing [103] . This method builds a deep 
generative model to discover hidden binary units {i.e., la¬ 
tent topic features) which can model input text data {i.e.. 
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word-count vectors). Such a deep model is made as a stack 
of Restricted Boltzmann Machines (RBMs) |104) . After 
learning a multi-layer RBM through pre-training and fine- 
tuning on a collection of documents, the hash code of any 
document is acquired by simply thresholding the output of 
the deepest layer. Such hash codes provided by the deep 
RBM were shown to preserve semantically similar relation¬ 
ships among input documents into the code space, in which 
each hash code (or hash key) is used as a memory address 
to locate corresponding documents. In this way, semanti¬ 
cally similar documents are mapped to adjacent memory 
addresses, thereby enabling efficient search via hash ta¬ 
ble lookup. To enhance the performance of deep RBMs, 
a supervised version was proposed in [66) . which borrows 
the idea of nonlinear Neighbourhood Component Analy¬ 
sis (NCA) embedding |105j . The supervised information 
stems from given neighbor/nonneighbor relationships be¬ 
tween training examples. Then, the objective function of 
NCA is optimized on top of a deep RBM, making the deep 
RBM yield discriminative hash codes. Note that super¬ 
vised deep RBMs can be applied to broad data domains 
other than text data. In [55] , supervised deep RBMs using 
a Gaussian distribution to model visible units in the first 
layer were successfully applied to handle massive image 
data. 

A recent work named Sparse Similarity-Preserving 
Hashing [55] tried to address the low recall issue pertaining 
to relatively long hash codes, which affect most of previous 
hashing techniques. The idea is enforcing sparsity into the 
hash codes to be learned from training examples with pair¬ 
wise supervised information, that is, similar and dissimilar 
pairs of examples (also known as side information in the 
machine learning literature). The relaxed hash functions, 
actually nonlinear embedding functions, are learned by 
training a Tailored Feed-Forward Neural Network. Within 
this architecture, two ISTA-type networks [110] that share 
the same set of parameters and conduct fast approxima¬ 
tions of sparse coding are coupled in the training phase. 
Since each output of the neural network is continuous al¬ 
beit sparse, a hyperbolic tangent function is applied to the 
output followed by a thresholding operation, leading to 
the final binary hash code. In |^, an extension to hash¬ 
ing multimodal data, e.g., web images with textual tags, 
was also presented. 

Another work named Deep Hashing cns] developed a 
deep neural network to learn a multiple hierarchical nonlin¬ 
ear transformation which maps original images to compact 
binary hash codes and hence supports large-scale image re¬ 
trieval with the learned binary image representation. The 
deep hashing model is established under three constraints 
which are imposed on the top layer of the deep neural net¬ 
work: 1) the reconstruction error between an original real¬ 
valued image feature vector and the resulting binary code 
is minimized, 2) each bit of binary codes has a balance, 
and 3) all bits are independent of each other. Similar con¬ 
straints have been adopted in prior unsupervised hashing 
or binary coding methods such as Iterative Quantization 
(iTQ) A supervised version called Supervised Deep 


Hashing was also presented in [106) , where a discrimina¬ 
tive term incorporating pairwise supervised information is 
added to the objective function of the deep hashing model. 
The authors of [106) showed the superiority of the super¬ 
vised deep hashing model over its unsupervised counter¬ 
part. Both of them produce hash codes through thresh¬ 
olding the output of the top layer in the neural network, 
where all activation functions are hyperbolic tangent func¬ 
tions. 

It is worthwhile to point out that the above methods, in¬ 
cluding Sparse Similarity-Preserving Hashing, Deep Hash¬ 
ing and Supervised Deep Hashing, did not include a pre¬ 
training stage during the training of the deep neural net¬ 
works. Instead, the hash codes are learned from scratch 
using a set of training data. However, the absence of 
pre-training may make the generated hash codes less effec¬ 
tive. Specifically, the Sparse Similarity-Preserving Hashing 
method is found to be inferior to the state-of-the-art su¬ 
pervised hashing method Kernel-Based Supervised Hash¬ 
ing (KSH) [22) in terms of search accuracy on some im¬ 
age datasets (99); the Deep Hashing method and its su¬ 
pervised version are slightly better than ITQ and its su¬ 
pervised version CCA-I-ITQ, respectively [III)[I06) . Note 
that KSH, ITQ and CCA-I-ITQ exploit relatively shallow 
learning frameworks. 

Almost all existing hashing techniques including the 
aforementioned ones relying on deep neural networks take 
a vector of hand-crafted visual features extracted from an 
image as input. Therefore, the quality of produced hash 
codes heavily depends on the quality of hand-crafted fea¬ 
tures. To remove this barrier, a recent method called Con¬ 
volutional Neural Network Hashing [107) was developed to 
integrate image feature learning and hash value learning 
into a joint learning model. Given pairwise supervised in¬ 
formation, this model consists of a stage of learning ap¬ 
proximate hash codes and a stage of training a deep Con¬ 
volutional Neural Network (CNN) [112) that outputs con¬ 
tinuous hash values. Such hash values can be generated 
by activation functions like sigmoid, hyperbolic tangent 
or softmax, and then quantized into binary hash codes 
through appropriate thresholding. Thanks to the power of 
CNNs, the joint model is capable of simultaneously learn¬ 
ing image features and hash values, directly working on 
raw image pixels. The deployed CNN is composed of three 
convolution-pooling layers that involve rectified linear ac¬ 
tivation, max pooling, and local contrast normalization, a 
standard fully-connected layer, and an output layer with 
softmax activation functions. 

Also based on CNNs, a latest method called as Deep Se¬ 
mantic Ranking Hashing [108] was presented to learn hash 
values such that multilevel semantic similarities among 
multi-labeled images are preserved. Like the Convolu¬ 
tional Neural Network Hashing method, this method takes 
image pixels as input and trains a deep CNN, by which 
image feature representations and hash values are jointly 
learned. The deployed CNN consists of five convolution- 

® It is essentially semi-supervised as abundant unlabeled examples 
are used for training the deep neural network. 
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TABLE III 

The characteristics of eight recently proposed deep learning based hashing methods. 


Deep Learning based 

Hashing Methods 

Data 

Domain 

Learning 

Paradigm 

Learning 

Features? 

Hierarchy of 

Deep Neural Networks 

Semantic Hashing [103] 

text 

unsupervised 

no 

4 

Restricted Boltzmann Machine |66| 

text and image 

supervised 

no 

4 and 5 

Tailored Feed-Forward Neural Network [99] 

text and image 

supervised 

no 

6 

Deep Hashing llUtil 

image 

unsupervised 

no 

3 

Supervised Deep Hashing [106] 

image 

supervised 

no 

3 

Convolutional Neural Network Hashing |107| 

image 

supervised 

yes 

5 

Deep Semantic Ranking Hashing |108| 

image 

supervised 

yes 

8 

Deep Neural Network Hashing |109| 

image 

supervised 

yes 

10 


pooling layers, two fully-connected layers, and a hash layer 
(z.e., output layer). The key hash layer is connected to 
both fully-connected layers and in the function expression 
as 

/i(x) = 2 ct(w"^[/i(x);/2(x)]) - 1 , 

in which x represents an input image, h(x) represents the 
vector of hash values for image x, fi (x) and /2 (x) respec¬ 
tively denote the feature representations from the outputs 
of the first and second fully-connected layers, w is the 
weight vector, and a{) is the logistic function. The Deep 
Semantic Ranking Hashing method leverages listwise su¬ 
pervised information to train the CNN, which stems from 
a collection of image triplets that encode the multilevel 
similarities, ie., the first image in each triplet is more sim¬ 
ilar to the second one than the third one. The hash code 
of image x is finally obtained by thresholding the output 
h{x) of the hash layer at zero. 

The above Convolutional Neural Network Hashing 
method una requires separately learning approximate 
hash codes to guide the subsequent learning of image rep¬ 
resentation and finer hash values.A latest method called 
Deep Neural Network Hashing |109) goes beyond, in which 
the image representation and hash values are learned in 
one stage so that representation learning and hash learn¬ 
ing are tightly coupled to benefit each other. Similar to the 
Deep Semantic Ranking Hashing method |108] . the Deep 
Neural Network Hashing method incorporates listwise su¬ 
pervised information to train a deep CNN, giving rise to a 
currently deepest architecture for supervised hashing. The 
pipeline of the deep hashing architecture includes three 
building blocks: 1) a triplet of images (the first image is 
more similar to the second one than the third one) which 
are fed to the CNN, and upon which a triplet ranking loss 
is designed to characterize the listwise supervised informa¬ 
tion; 2) a shared sub-network with a stack of eight convo¬ 
lution layers to generate the intermediate image features; 
3) a divide-and-encode module to divide the intermediate 
image features into multiple channels, each of which is en¬ 
coded into a single hash bit. Within the divide-and-encode 
module, there are one fully-connected layer and one hash 
layer. The former uses sigmoid activation, while the latter 
uses a piecewise thresholding scheme to produce a nearly 
discrete hash values. Eventually, the hash code of any im¬ 


age is yielded by thresholding the output of the hash layer 
at 0.5. In |109) . the Deep Neural Network Hashing method 
was shown to surpass the Convolutional Neural Network 
Hashing method as well as several shallow learning based 
supervised hashing methods in terms of image search ac¬ 
curacy. 

Last, we a few observations are worth mentioning deep 
learning based hashing methods introduced in this section. 

1. The majority of those methods did not report the 
time of hash code generation. In real-world search 
scenarios, the speed for generating hashes should be 
substantially fast. There might be concern about the 
hashing speed of those deep neural network driven ap¬ 
proaches, especially the approaches involving image 
feature learning, which may take much longer time to 
hash an image compared to shallow learning driven 
approaches like ITQ and KSH. 

2. Instead of employing deep neural networks to seek 
hash codes, another interesting problem is to design 
a proper hashing technique to accelerate deep neural 
network training or save memory space. The latest 
work [113j presented a hashing trick named Hashed- 
Nets, which shrinks the storage costs of neural net¬ 
works significantly while mostly preserving the gener¬ 
alization performance in image classification tasks. 

VI. Advanced Methods and Related 
Applications 

In this section, we further extend the survey scope to 
cover a few more advanced hashing methods that are devel¬ 
oped for specific settings and applications, such as point- 
to-hyperplane hashing, subspace hashing, and multimodal¬ 
ity hashing. 

A. Hyperplane Hashing 

Distinct from the previously surveyed conventional hash¬ 
ing techniques all of which address the problem of fast 
point-to-point nearest neighbor search (see Figure fT^ aH. 
a new scenario “point-to-hyperplane” hashing emerges to 
tackle fast point-to-hyperplane nearest neighbor search 
(see Figure [T^Kb)), where the query is a hyperplane instead 

The illustration figure is from 

http://vision.cs.utexas.edu/projects/activehash/ 
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Fig. 11 

Active learning framework with a hashing-based fast query 
SELECTION STRATEGY 0. 


of a data point. Such a new scenario requires hashing the 
hyperplane query to near database points, which is diffi¬ 
cult to accomplish because point-to-hyperplane distances 
are quite different from routine point-to-point distances in 
terms of the computation mechanism. Despite the bulk 
of research on point-to-point hashing, this special hash¬ 
ing paradigm is rarely touched. For convenience, we call 
point-to-hyperplane hashing as Hyperplane Hashing. 

Hyperplane hashing is actually fairly important for 
many machine learning applications such as large-scale ac¬ 
tive learning with SVMs m- In SVM-based active learn¬ 
ing the well proven sample selection strategy is to 

search in the unlabeled sample pool to identify the sample 
closest to the current hyperplane decision boundary, thus 
providing the most useful information for improving the 
learning model. When making such active learning scal¬ 
able to gigantic databases, exhaustive search for the point 
nearest to the hyperplane is not efficient for the online sam¬ 
ple selection requirement. Hence, novel hashing methods 
that can principally handle hyperplane queries are called 
for. A conceptual diagram using hyperplane hashing to 
scale up active learning process is demonstrated in Fig¬ 
ure [HI 

We demonstrate the geometric relationship between a 
data point x and a hyperplane with the vector normal 
as w in Figure fT^I ab Given a hyperplane query and a 
set of points X, the target nearest neighbor is 

X* = argminD(x,'Pw), 

xeX 

where D(x, Pw) = is the point-to-hyperplane dis¬ 

tance. The existing hyperplane hashing methods mm 
all attempt to minimize a slightly modified “distance” 

, i.e., the sine of the point-to-hyperplane angle 

ftx.w = |fi*x,w ^ f |- Note that 0x,w £ [0,7r] is the angle be¬ 
tween X and w. The angle measure ax,w 6 [0,7r/2] between 
a database point and a hyperplane query turns out to be 
reflected into the design of hash functions. 

As shown in Figure fTST bi. the goal of hyperplane hashing 
is to hash a hyperplane query V-w and the desired neighbors 
(e.g., Xi,X 2 ) with narrow cLx.w into the same or nearby 
hash buckets, meanwhile avoiding to return the unde- 
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Two DISTINCT NEAREST NEIGHBOR SEARCH PROBLEMS, (a) 
Point-to-point search, the blue solid circle represents a 


POINT query and the RED CIRCLE REPRESENTS THE FOUND 
NEAREST NEIGHBOR POINT, (b) POINT-TO-HYPERPLANE SEARCH, THE 
BLUE PLANE DENOTES A HYPERPLANE QUERY Vvi WITH W BEING ITS 
NORMAL VECTOR, AND THE RED CIRCLE DENOTES THE FOUND 
NEAREST NEIGHBOR POINT. 


sired nonneighbors (e.g., X 3 ,X 4 ) with wide q;x,w. Because 
ckx.w = |^*x,w ~ f |) the point-to-hyperplane search problem 
can be equivalently transformed to a specific point-to-point 
search problem where the query is the hyperplane normal 
w and the desired nearest neighbor to the raw query 
is the one whose angle 9x,w from w is closest to 7r/2, i.e., 
most closely perpendicular to w (we write “perpendicular 
to w” as T w for brevity). This is very different from tradi¬ 
tional point-to-point nearest neighbor search which returns 
the most similar point to the query point. In the following, 
several existing hyperplane hashing methods will be briefly 
discussed 

Jain et al. m devised two different families of ran¬ 
domized hash functions to attack the hyperplane hashing 
problem. The first one is Angle-Hyperplane Hash (AH- 
Hash) A, of which one instance function is 

h^{z) = 

[sgn(u^z),sgn(v'''z)],z is a database point 
[sgn(u'''z),sgn(—v'''z)], z is a hyperplane normal 

(34) 

where z G represents an input vector, and u and v are 
both drawn independently from a standard d-variate Gaus¬ 
sian, i.e., u,v ~ A/’(0,/dxd). Note that h-^ is a two-bit hash 
function which leads to the probability of collision for a hy¬ 
perplane normal w and a database point x: 

Pr[t.«(w).i^(x)].i-5|f:. (35) 

This probability monotonically decreases as the point-to- 
hyperplane angle ax,w increases, ensuring angle-sensitive 
hashing. 

The second family proposed by Jain et al. is Embedding- 
Hyperplane Hash (EH-Hash) function family £ of which 
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(a) (b) 



Fig. 13 

The hyperplane hashing problem, (a) Point-to-hyperplane 
DISTANCE D(x,Pw) and POINT-TO-HYPERPLANE ANGLE Ox.w. (b) 

Neighbors (xi,x 2 } and nonneighbors ( 11 : 3 , 0 : 4 ) of the 

HYPERPLANE QUERY "Pw, AND THE IDEAL NEIGHBORS ARE THE 
POINTS -L W. 


one instance is 

, , _ I sgn(U'''V(zz''')) ,z is a database point 
^ |sgn(—U'''V(zz''' )): z is a hyperplane normal 

(36) 

where V(A) returns the vectorial concatenation of matrix 
A, and U ^ M{Q,Id'^x<p)- The EH hash function yields 
hash bits on an embedded space resulting from vec¬ 
torizing rank-one matrices zz"'" and —zz”'". Compared with 
gives a higher probability of collision: 

Pr [h^(w) = /i^(x)] = cos~^sin"(«x,w) ^ 

L A 

which also bears the angle-sensitive hashing property. 
However, it is much more expensive to compute than AH- 
Hash. 

More recently, Liu et al. m designed a randomized 
function family with bilinear Bilinear-Hyperplane Hash 
(BH-Hash) as: 

S = {h'^(z) = sgn(u’^zz’^v), i.i.d. u,v ~ A/'(0,/dxd)} • 

(38) 

As a core finding, Liu et al. proved in m that the proba¬ 
bility of collision for a hyperplane query and a database 
point X under is 

(39) 

Specifically, h®((Pw) is prescribed to be —/i®(w). Eq. (1551) 
endows h® with the angle-sensitive hashing property. It is 
important to find that the collision probability given by the 
BH hash function h® is always twice of the collision prob¬ 
ability by the AH hash function and also greater than 
the collision probability by the EH hash function . As 
illustrated in Eigure [T4j for any fixed r, BH-Hash accom¬ 
plishes the highest probability of collision, which indicates 
that the BH-Hash has a better angle-sensitive property. 

In terms of the formulation, the bilinear hash function 
is correlated with yet different from the linear hash 



Comparison of the collision probabilities of the three 

RANDOMIZED HYPERPLANE HASHING SCHEMES USING pi (PROBABILITY 

OF collision) vs. r (SQUARED POINT-TO-HYPERPLANE ANGLE) 

functions and . (1) /i® produces a single hash bit 
which is the product of the two hash bits produced by h-^. 

(2) h® may be a rank-one special case of in algebra if we 
write u^zz'''v = tr(zz'''vu^) and \]^'V{zz^ ) = tr{zz'^\J). 

(3) h® appears in a universal form, while both h-^ and 
treat a query and a database item in a distinct man¬ 
ner. The computation time of /i® is 0(2d) which is the 
same as that of h'^ and one order of magnitude faster than 
0(2(i^) of h®. Liu et al. further improved the performance 
of /i® through learning the bilinear projection directions 
u,v in /i® from the data. Gong et al. extended the bilin¬ 
ear formulation to the conventional point-to-point hashing 
scheme through designing compact binary codes for high¬ 
dimensional visual descriptors |118j . 

B. Subspace Hashing 

Beyond the aforementioned conventional hashing which 
tackles searching in a database of vectors, subspace hash¬ 
ing [ns], which has been rarely explored in the literature, 
attempts to efficiently search through a large database of 
subspaces. Subspace representation is very common in 
many computer vision, pattern recognition, and statisti¬ 
cal learning problems, such as subspace representations of 
image patches, image sets, video clips, etc. For example, 
face images of the same subject with fixed poses but dif¬ 
ferent illuminations are often assumed to reside near linear 
subspaces. A common use scenario is to use a single face 
image to find the subspace (and the corresponding subject 
ID) closest to the query image |120) . Given a query in the 
form of vector or subspace, searching for a nearest subspace 
in a subspace database is frequently encountered in a vari¬ 
ety of practical applications including example-based im¬ 
age synthesis, scene classification, speaker recognition, face 
recognition, and motion-based action recognition [120] . 

However, hashing and searching for subspaces are both 
different from the schemes used in traditional vector hash¬ 
ing and the latest hyperplane hashing. m presented a 
general framework to the problem of Approximate Near¬ 
est Subspace (ANS) search, which uniformly deals with 
the cases that query is a vector or subspace, query and 
database elements are subspaces of fixed dimension, query 
and database elements are subspaces of different dimen¬ 
sion, and database elements are subspaces of varying di- 
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mension. The critical technique exploited by |119] is two- 
step: 1) a simple mapping that maps both query and 
database elements to “points” in a new vector space, and 2) 
doing approximate nearest neighbor search using conven¬ 
tional vector hashing algorithms in the new space. Conse¬ 
quently, the main contribution of |119] is reducing the diffi¬ 
cult subspace hashing problem to a regular vector hashing 
task. m used LSH for the vector hashing task. While 
simple, the hashing technique (mapping -|- LSH) of |119) 
perhaps suffers from the high dimensionality of the con¬ 
structed new vector space. 

More recently, |120| exclusively addressed the point-to- 
subspace query where query is a vector and database items 
are subspaces of arbitrary dimension. |120| proposed a rig¬ 
orously faster hashing technique than that of |119| . Its 
hash function can hash D-dimensional vectors {D is the 
ambient dimension of the query) or D x r-dimensional sub¬ 
spaces (r is arbitrary) in a linear time complexity 0{D), 
which is computationally more efficient than the hash func¬ 
tions devised in |119) . |120) further proved the search time 
under the 0{D) hashes to be sublinear in the database 
size. 

Based on the nice finding of |120] . we would like to 
achieve faster hashing for the subspace-to-subspace query 
by means of crafted novel hash functions to handle sub¬ 
spaces in varying dimension. Both theoretical and practi¬ 
cal explorations in this direction will be beneficial to the 
hashing area. 

C. MultiModality Hashing 

Note that the majority of the hash learning methods are 
designed for constructing the Hamming embedding for a 
single modality or representation. Some recent advanced 
methods are proposed to design the hash functions for more 
complex settings, such as that the data are represented by 
multimodal features or the data are formed in a heteroge¬ 
neous way m- Such type of hashing methods are closely 
related to the applications in social network, whether mul¬ 
timodality and heterogeneity are often observed. Below we 
survey several representative methods that are proposed 
recently. 

Realizing that data items like webpage can be described 
from multiple information sources, composing hashing was 
recently proposed to design hashing schme using several 
information sources m- Besides the intuitive way of con¬ 
catenating multiple features to derive hash functions, the 
author also presented an iterative weighting scheme and 
formulated convex combination of multiple features. The 
objective is to ensure the consistency between the seman¬ 
tic similarity and Hamming similarity of the data. Fi¬ 
nally, a joint optimization strategy is employed to learn 
the importance of individual type of features and the hash 
functions. Co-regularized hashing was proposed to inves¬ 
tigate the hashing learning across multiparity data in a 
supervised setting, where similar and dissimilar pairs of 
intra-modality points are given as supervision informa¬ 
tion |123) . One of such a typical setting is to index im¬ 
ages and the text jointly to preserve the semantic relations 


between image and text. The authors formulate their ob¬ 
jective as a boosted co-regularization framework with the 
cost component as a weighted sum of the intra-modality 
and inter-modality loss. The learning process of the hash 
functions is performed via a boosting procedure to that the 
bias introduced by previous hash function can be sequen¬ 
tially minimized. Dual-view hashing attempts to derive 
a hidden common Hamming embedding of data from two 
views, while maintaining the predictability of the binary 
codes |124) . A probabilistic model called multimodal la¬ 
tent binary embedding was recently presented to derive bi¬ 
nary latent factors in a common Hamming space for index¬ 
ing multimodal data |125] . Other closely related hashing 
methods include the design of multiple feature hashing for 
near-duplicate duplicate detection |126] , submodular hash¬ 
ing for video indexing |127] . and Probabilistic Attributed 
Hashing for integrating low-level features and semantic at¬ 
tributes |128] . 

D. Applications with Hashing 

Indexing massive multimedia data, such as images and 
video, are the natural applications for learning based 
hashing. Especially, due to the well-known seman¬ 
tic gap, supervised and semi-supervised hashing meth¬ 
ods have been extensively studied for image search 
and retrieval [55][^ [lT]|129][nD]|131) . mobile product 
search [132]. Other closely related computer vision ap¬ 
plications include image patch matching m, image 
classification m, face recognition |134j|135j . pose esti¬ 
mation in], object tracking [136] . and duplicate detec¬ 
tion [126] |137j [5^ [5T] [138] . In addition, this emerging hash 
learning framework can be exploited for some general ma¬ 
chine learning and data mining tasks, including cross¬ 
modality data fusion [139] . large scale optimization [140] . 
large scale classification and regression m, collaborative 
filtering ma, and recommendation [143] . For indexing 
video sequences, a straightforward method is to indepen¬ 
dently compute binary codes for each key frames and use a 
set of hash code to represent video index. More recently, Ye 
et al. proposed a structure learning framework to derive a 
video hashing technique that incorporates both temporal 
and spatial structure information [144] . In addition, ad¬ 
vanced hashing methods are also developed for document 
search and retrieval. For instance, Wang et al. proposed 
to leverage both tag information and semantic topic mod¬ 
eling to achieve more accurate hash codes [145] . Li et al. 
designed a two-stage unsupervised hashing framework for 
fast document retrieval [146] . 

Hashing techniques have also been applied to the ac¬ 
tive learning framework to cope with big data applica¬ 
tions. Without performing exhaustive test on all the 
data points, hyperplane hashing can help significantly 
speed up the interactive training sample selection pro¬ 
cedure [114||117)|116j . In addition, a two-stage hashing 
scheme is developed to achieve fast query pair selection for 
large scale active learning to rank [147] . 
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VII. Open Issues and Future Directions 

Despite the tremendous progress in developing a large 
array of hashing techniques, several major issues remain 
open. First, unlike the locality sensitive hashing family, 
most of the learning based hashing techniques lack the the¬ 
oretical guarantees on the quality of returned neighbors. 
Although several recent techniques have presented theo¬ 
retical analysis of the collision probability, they are mostly 
based on randomized hash functions [n6][Tl7][l9]. Hence, 
it is highly desired to further investigate such theoretical 
properties. Second, compact hash codes have been mostly 
studied for large scale retrieval problems. Due to their 
compact form, the hash codes also have great potential in 
many other large scale data modeling tasks such as efficient 
nonlinear kernel SVM classifiers |148] and rapid kernel ap¬ 
proximation [149] . A bigger question is: instead of using 
the original data, can one directly use compact codes to 
do generic unsupervised or supervised learning without af¬ 
fecting the accuracy? To achieve this, theoretically sound 
practical methods need to be devised. This will make ef¬ 
ficient large-scale learning possible with limited resources, 
for instance on mobile devices. Third, most of the current 
hashing technicals are designed for given feature represen¬ 
tations that tend to suffer from the semnatic gap. One of 
the possible future directions is to integrate representation 
learning with binary code learning using advanced learn¬ 
ing schemes such as deep neural network. Finally, since 
heterogeneity has been an important characteristics of the 
big data applications, one of the future trends will be to 
design efficient hashing approaches that can leverage het¬ 
erogeneous features and multi-modal data to improve the 
overall indexing quality. Along those lines, developing new 
hashing techniques for composite distance measures, i.e., 
those based on combinations of different distances acting 
on different types of features will be of great interest. 
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